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Abstract: This paper introduces an objective function that seeks to minimise 
the average total number of bits required to encode the joint state of all of the 
layers of a Markov source. This type of encoder may be applied to the problem of 
optimising the bottom-up (recognition model) and top-down (generative model) 
connections in a multilayer neural network, and it unifies several previous results 
on the optimisation of multilayer neural networks. 

1 Introduction 

There is currently a great deal of interest in modelling probability density 
functions (PDF). This research is motivated by the fact that the joint PDF 
of a set of variables can be used to deduce any conditional PDF which in- 
volves these variables alone, which thus allows all inference problems in the 
space of these variables to be addressed quantitatively. The only limitation 
of this approach to solving inference problems is that a model of the PDF is 
used, rather than the actual PDF itself, which can lead to inaccurate infer- 
ences. The objective function for optimising a PDF model is usually to max- 
imise the log-likelihood that it could generate the training set: i.e. maximise 
(log (model probability)) training set . 

In this paper the problem of modelling the PDF of a Markov source will be 
studied. In the language of neural networks, this type of source can be viewed as 
a layered network, in which the state of each layer directly influences the states 
of only the layers immediately above and below it. The optimal PDF model 
then approximates the joint PDF of the states of all of the layers of the network, 
or at least some subset of the layers of the network, which is a generalisation of 
what is conventionally done in neural network PDF models. 

Markov source density modelling is interesting because it unifies a number 
of existing neural network techniques into a single framework, and may also 
be viewed as a density modelling perspective on the results reported in [Tfi] . 
For instance, an approximation to the standard Kohonen topographic mapping 

'Submitted to Neural Computation on 2 February 1998. Manuscript no. 1729. It was not 
accepted for publication, but it underpins several subsequently published papers. 
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neural network 3] emerges from density modelling the joint PDF of the input 
and output layers of a 3-layer network, and generalisations of the Kohonen 
network also emerge naturally from this framework. 

In section [51 the relevant parts of the Shannon theory of information are 
summarised J21> and the application to coding various types of source is derived 
jllj : in particular, Markov sources are discussed, because they are the key to the 
approach that is presented in this paper. In sectionElthe application of Markov 
source coding to unsupervised neural networks is discussed in detail, including 
the Kohonen network pj] . In section 31 (and appendix El hierarchical encoding 
using an adaptive cluster expansion (ACE) is discussed fHj, and in sectional (and 
appendix[Bj| factorial encoding using a partitioned mixture distribution (PMD) 
are discussed [Hj. Finally in appendix [0 density modelling of Markov sources is 
compared with standard density modelling using a Helmholtz machine [2]. 

2 Coding Theory 

In section l2~Tl the basic ideas of information theory are outlined (this discussion 
is inspired by the reasoning presented in 521), and in section l2~2l the process 
of using a model to code a source described in detail. In section 12.31 this is 
extended to the case of a Markov source. In section l2~H the relationship between 
conventional density models and Markov density models is discussed. 

See p2] for a lucid introduction to information theory, and see [IX for a 
discussion of the number of bits required to encode a source using a model. 

2.1 Information Theory 

A source of symbols (drawn from an alphabet of M distinct symbols) is modelled 
as a vector of probabilities denoted as P 

P=(P 1 ,P 2 ,-.. ,P M ) (1) 

which describes the relative frequency with which each symbol is drawn inde- 
pendently from the source P. A trivial example is an unbiassed die, which has 
M = 6 and Pi = ± for i = 1, 2, • • • , 6. 

The ordered sequence of symbols drawn independently from a source may 
be partitioned into subsequences of TV symbols, and each such subsequence will 
be called a message. If N is very large, then a message is called "likely" if the 
relative frequency of occurrence of its symbols approximates P, otherwise it is 
called "unlikely". As N — > oo the set of likely messages is very sharply defined, 
in the sense that the proportion of all messages that lie in the transition region 
between being likely and being unlikely becomes vanishingly small. Thus there 
is a set of likely messages all with equal probability of occurring (because each 
likely message has the same relative frequency of occurrence of each of the M 
possible symbols), and a set of unlikely messages (i.e. all the messages that 
are not likely messages) that have essentially zero probability of occurring. It 
is this separation of messages into a likely set (all with equal probability) and 
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an unlikely set (all with zero probability) that underlies information theory, as 
discussed in |12j . 

A likely message from P will be called a likely P-message. As N — ► oo 
the number of times that each symbol i occurs in a P-message of length N 
is n-i — NPi, where Pi — ^ guarantees that the normalisation condition 

Yli=i n i — N is satisfied. The logarithm of the number of different likely P- 
messages is given by (using Stirling's approximation log a:! ss xloga; — x when x 
is large) 

l0 § — rn An-Nj^PiiogPi (2) 

\ni\n 2 \ ■ ■ ■ n M ] - J ~ 

Now define the entropy H (P) of source P as the logarithm of the number 
of different likely P-messages (measured per message symbol) : 

M 

H(P) = -Y,^ogP l >0 (3) 

i=l 

Thus H (P) is the number of bits per symbol (on average) required to encode 
the source (assuming a perfect encoder), because the only messages that the 
source has a finite probability of producing are the likely P-messages that are 
enumerated in equation 

It is usually very difficult to encode the source P using H (P) bits per symbol 
on average. This is because although the boundary between the set of likely P- 
messages and the set of unlikely P-messages is sharply defined in principle, 
in practice it is very hard to model mathematically. If this boundary is not 
precisely defined, then it is impossible to compute the value of H (P) accurately. 
In order to ensure that all of the likely P-messages are accounted for, it is 
necessary for the mathematical model of the boundary to lie outside the true 
boundary, which thus overestimates the value of H (P). This demonstrates that 
H (P) is in fact a lower bound on the true number of bits per symbol that must 
be used to encode the source P. 

2.2 Source Coding 

The mathematical model (or, simply, the model) of the boundary between the 
set of likely P-messages and the set of unlikely P-messages may be derived 
from a another vector of probabilities, denoted as Q, whose M elements model 
the probability of each symbol drawn from an alphabet of M distinct symbols. 
If Q = P then the boundary is modelled perfectly, and hence in principle the 
lower bound H (P) on the number of bits per symbol may be attained, although 
even this is usually difficult to realise constructively in practice. In practical 
situations Q =^ P is invariably the case, so the problem of coding a source with 
an inaccurate model cannot be avoided. 

Since the only P-messages that can occur are the likely P-messages (which 
all occur with equal probability), the number of bits required when using Q 
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to encode P is (minus) the logarithm of the probability TIn (P, Q) that a Q- 
message is one of the likely P-messages. IIjv (P, Q) is given by 

n N (P, Q) = log ( — ^ q^QT ■ ■ ■ Q n M M 

M 

«-A^Paog-^<0 (4) 

i=l ^ l 

which is negative because the model Q generates likely P-messages with less 
than unit probability. The model Q must be used to generate enough Q- 
messages to ensure that all of the likely P-messages are reproduced, which 
requires the basic H (P) bits per symbol (that would be required if Q = P), 
plus some extra bits to compensate for the less than 100% efficiency with which 
Q generates likely P-messages (because Q ^P). The number of extra bits per 
symbol is the relative entropy G (P, Q) 

M 

G(P,Q)^]TP 4 log-L>0 (5) 

which is _ n "(P'Q) ; or m i nus the logarithm of the probability per symbol that a 
Q-message is a likely P-message. Thus Q is used to generate exactly the number 
of extra Q-messages required to compensate for the fact that the probability that 
each Q-message is a likely P-message is less than unity (i.e. TIn (P, Q) < 0). 
G (P, Q) (i.e. relative entropy) is the amount by which the number of bits per 
symbol exceeds the lower bound H (P) (i.e. source entropy). For completeness, 
also define the total number of bits per symbol H (P) + G (P, Q) as L (P, Q), 
which is given by 

L(P,Q) = #(P) + G(P,Q) 

M 



= -^p J io g g l >o (6) 



The expression for G (P, Q) provides a means of optimising the model Q. If 
the optimisation criterion is that the average number of bits per symbol required 
when using Q to encode P should be minimised, then the optimum model Q op t 
should minimise the objective function G (P, Q) with respect to Q, thus 

Q opt = arg ^ in G(P,Q) (7) 

This criterion for optimising a model does not include the number of bits re- 
quired to specify the model itself, such as is used in the minimum description 
length approach ^T], although the objective function could be extended to in- 
clude such additional contributions. 

G(P,Q) is frequently used as an objective function in density modelling, 
where the source P is the vector of observed symbol frequencies. Since Q op t 
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must, in some sense, be close to P, this affords a practical way of ensuring 
that the optimum model probabilities Q op t are similar to the source symbol 
frequencies P, which is the goal of density modelling. 

2.3 Markov Source Coding 

The above scheme for using a model Q to encode symbols derived from a source 
P may be extended to the case where the source and the model are L-layer first 
order Markov chains. The word "layer" is used in anticipation of the connection 
with multilayer neural networks that will be discussed in section Thus split 
up both of P and Q into their constituent transition probabilities 

P= ^P° P 1 ' ••• pL-ML-Z pL\L-l^ 

= (P ! 1 ^ 1 ' 2 , ••• ,P L - 1 l L ,P i ) 
Q= (C^Q 1 ' , ••■ ,Q i - 1 l L - 2 ,Q L l i - 1 ) 

= (Q^,Q^,---,Q L - llL ,Q L ) (8) 

These two ways of decomposing P (and Q) are equivalent, because a forward 
pass through a Markov chain may be converted into a backward pass through a 
different Markov chain, whose transition probabilities are uniquely determined 
by applying Bayes' theorem to the original Markov chain. P fc ' ( (and Q fc '') is the 
matrix of transition probabilities from layer I to layer k of the Markov chain of 
the source (and model), P° (Q ) is the vector of marginal probabilities in layer 
0, P L (and Q L ) is the vector of marginal probabilities in layer L. This may be 
written out in detail as 

Pf = true probability that layer has state io 
P^ L = true probability that layer L has state il 
= true probability that layer I + 1 has state 
given that layer I has state i\ 
Ph.h+i ~ true probability that layer I has state ii 

given that layer I + 1 has state (9) 

Q° = model probability that layer has state io 
Qf L = model probability that layer L has state %l 
Q 1 ^ i i t = m °del probability that layer I + 1 has state 

given that layer I has state i\ 
QiiX+i ~ m °del probability that layer I has state i\ 

given that layer I + 1 has state (10) 
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The number of extra bits per symbol G (P, Q) (see equation required to 
encode each symbol from the source P using the model Q may then be written 

as 

g(p,q) = e-E<1^1-^::£^ 

io=l ih = l 

/ pO|l pl|2 _ p L-l\L pL \ 

lo ' 1 o 1 ' 2 ■•• L ~ 1|L o L / 

i-1 Af 1+ i 

= e e ( pi|m + g ( pi < q l ) ("J 

i=0 i !+ i=l 

where the flow of influence in both P and Q is from layer to layer L. The 
suffix that appears on the G% l+i (p'l i+1 , Q'l i+1 ) indicates that the state of 
layer I + 1 is fixed during the evaluation of G il+1 (P'l /+1 , Q'l /+1 ) (i.e. it is the 
relative entropy of layer I, given that the state of layer l + l is known). Similarly, 
the total number of bits per symbol required to encode each symbol from the 
source P using the model Q is L (P, Q) (i.e. H (P) + G (P, Q)), which is given 
by 

h-X Mi+i 

L (p,Q)=j2 e ^::^ !+1 (p' ii+i ,Q' ii+i )+i(p L ,Q L ) (12) 

1=0 i !+1 =l 

This result has a very natural interpretation. Both the source P and the 
model Q are Markov chains, and corresponding parts of the model are matched 
up with corresponding parts of the source. First of all, the number of bits 
required to encode the L th layer of the source is L (P L , Q L ) . Having done that, 
the number of bits required to encode the L — 1 th layer of the source, given 
that the state of the L th layer is already known, is L (P i_1 l L , Q L_1 I L ) , which 
must then be averaged over the alternative possible states of the L th layer to 
yield Yn L =x P L L ( pi ~ 1|L > Q L ~ 1|L )- This process is then repeated to encode 
the L — 2 th layer of the source, given that the state of the L — 1 th layer is 
already known, and so on back to layer 0. This yields precisely the expression 
for L (P, Q) given above. 

Bayes' theorem (in the form Pj+l = Pi P^ it ) 

may be used to 

rewrite the expression for L (P, Q) so that the flow of influence in P and Q 
runs in opposite directions. Thus 

L-l Mi 

l cp, q) = e E p l K * ( p/+1 "< + L ( pL > Q L ) (1 3 ) 

J=0 ij=l 

where K u (pWI*, QW+ 1 ) is defined as 

^(p ,+1|i .Q ,|,+1 )- E (1 4 ) 

il+i=X 
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The expression for L (P, Q) in equation ED has an analogous interpretation to 
that in equation 1121 

Other types of Markov chain may also be considered, such as ones in which 
some of the layers are not included in the calculation of the number of bits 
required to encode the source. One such example is discussed in section IP1 

2.4 Alternative Viewpoints 

The relationship between conventional density models and Markov density mod- 
els can be stated from the point of view of a conventional density modeller. The 
goal is to build a density model Q° of the source P°, such that the number 
of bits per symbol L (P°,Q°J required to encode P° is minimised. However, 
if the source P° is transformed through L layers of a network to produce a 
transformed source P L , then L(P°,Q°) < L (P, Q) where L (P, Q) is given 
in equation 1121 which is the sum of the number of bits per symbol L (P L , Q L ) 
required to encode P L , plus (for I = 0, 1, • • • , L — 1) the number of bits per 
symbol J2nlT=i P k+x L n+x (P' |(+1 , Q'l i+1 ) required to encode P'l i+1 . 

Thus the problem of encoding the source P° can be split into three steps: 
transform the source from P° to P L , encode the transformed source P L , and 
encode all of the transformations P'l z+1 (for I = 0, 1, • • • , L — 1) to allow the 
original source to be reconstructed from the transformed source. The total 
number of bits L (P, Q) required to encode P L and (for I = 0, 1, • • • , L— 1) 

is then an upper bound on the total number of bits L (P°, Q°) required to encode 
P°. In this picture, a Markov chain is used to connect the original source P° to 
the transformed source P L , so the Markov chain relates one conventional density 
modelling problem (i.e. optimising Q°) to another (i.e. optimising Q L ). 

The above description of the relationship between conventional density mod- 
els and Markov density models was presented from the point of view of a con- 
ventional density modeller, who asserts that the goal is to build an optimum 
(i.e. minimum number of bits per symbol) density model Q° of the source P°. 
From this point of view, the Markov chain is merely a means of transforming 
the problem from modelling the source P° to modelling the transformed source 
P L . That this transformation process is imperfect is reflected in the fact that 
more bits per symbol are required to encode the transformed source P L (plus 
the state of the Markov chain that generates it) than the original source P°. A 
conventional density modeller might reasonably ask what is the point of using 
Markov density models, if they give only an upper bound on the number of bits 
per symbol for encoding the original source P ? 

However, it is not at all clear that the conventional density modeller is using 
the correct objective function in the first place. Why should the number of 
bits per symbol for encoding the original source P° be especially important? It 
is as if the world has been separated into an external world (i.e. P°) and an 
internal world (i.e. the P /+1 l' for I = 0, 1, • • • , L — 1), and a special status is 
accorded to the external world, which deems that it is important to model its 
density P° accurately, at the expense of modelling the P l+1 \ l accurately. In the 
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Markov density modelling approach, this artificial boundary between external 
and internal worlds is removed, because the Markov chain models the joint 
density (P°, P 1|n , • • • , P^ 1 ^ 1 ), where P° and the P* +1|/ are all accorded equal 
status. This even-handed approach is much more natural than one in which 
a particular part of the source (i.e. the external source) is accorded a special 
status. 

In the language of multilayer neural networks, the vector 
^po pO|i ... ^pL\L-i^ - g source which comprises the bottom-up trans- 
formations (or recognition models) which generate the states of the internal 
layers of the network, and the vector (Q ' 1 , Q 1 ' 2 , • ■ • , Q L ) is the model of 
the source which comprises the top-down transformations (or generative 
models). Thus the network is self-referential, because it forms a model of a 
source that includes its own internal states. This self-referential behaviour is 
present in both the conventional density modelling and in the Markov density 
modelling approaches, but whereas in the former case it is not optimised in 
an even-handed fashion, in the latter case it is optimised in an even-handed 
fashion. 

3 Application To Unsupervised Neural Networks 

In section ETTl the theory of Markov source coding (that was presented in section 
12.311 is applied to a multilayer neural network. In section ET2l this approach is 
applied to a 2-layer neural network to obtain a soft vector quantiser (VQ), which 
is generalised to a multilayer neural network in section EOI to obtain a network of 
coupled soft VQs. In section IPI it is shown how an approximation to Kohonen's 
topographic mapping network can be derived from the theory of Markov source 
coding. Finally, some additional results are briefly dicussed in section 

3.1 Source Model of a Layered Network 

In this section the optimisation of the joint PDF of the states of all of the layers 
of an (L + l)-layer encoder of the type that was discussed in section l2~3l will be 
considered. It turns out that this leads to new insights into the optimisation of 
a multilayer unsupervised neural networks. 

The Markov chain source P = (P ,? 1 ! ,- • • , pL-x\L-2^ p L\L-l^ ^ equiv . 

alently, P = (P 0| \ P 1|2 5 • • • ,P L " 1 l L ,P i )) may be used to describe the true 
behaviour (i.e. not merely a model) of a layered neural network as follows. P° 
is an external source, and (P 1 ' , • • • , p L - 1 \ L ~ 2 j pL\L-i^ i s an internal source, 
where external/internal describes whether the source is outside/inside the lay- 
ered network, respectively. P' +1 l' is not part of the source itself (i.e. the external 
source) , rather it is a transition matrix that describes the way in which the state 
of layer I of the neural network influences the state of layer I + 1 . There is an 
analogous interpretation of P L and the . 

The Markov chain model Q = (Q°, Q 1 ' , • • • , Q 1 -- 1 ^- 2 , Q^' 1 ) (or, equiv- 
alent!^ Q = (Q I\ Q 1 ! 2 , • • • , Q L -^ L , Q L )) may then be used as a model (i.e. 
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not actually the true behaviour) of a layered neural network. Q has an anal- 
ogous interpretation to P, except that it is a model of the source, rather than 
the true behaviour of the source. 

It turns out to be useful for the true Markov behaviour (i.e. P) and the model 
Markov behaviour (i.e. Q) to run in opposite directions through the Markov 
chain. Thus P = (P^P 1 ! , ■ • • , pL~i\L-2^ p£|£-i) ( flow of in fl uence f rom i ayer 

to layer L of the Markov chain) and Q = (Q ! 1 , Q 1 ' 2 , • • ■ , Q L -^ L , Q L ) (flow 
of influence from layer L to layer of the Markov chain) . In the conventional 
language of neural networks, P is a "recognition model" and Q is a "generative 
model". Note that the use of the word "model" in the terminology "recognition 
model" is strictly speaking not accurate in this context, because P is a source, 
not a model. However, terminology depends on one's viewpoint. In Markov 
chain density modelling P is a source when viewed from the point of view of 
the model Q. Whereas, in conventional density modelling P° is a source when 
viewed from the point of model Q°, in which case (P 1 ' , • • • , p i - 1 l L ~ 2 j p L l L -!) 
is a recognition model and (Q ' 1 , Q 1 ' 2 , ■ • • , Q i ^ 1 ' i , Q L ) is a generative model. 

3.2 2-Layer Soft Vector Quantiser (VQ) Network 

The expression for L (P, Q) in equation ED has a simple internal structure which 
allows it to be systematically analysed. Thus apply equation El to a 2-layer 
network where 



P = (P°, P 1 ' ) 

Q = (q 0|1 ,Q 1 ) (15) 



to obtain the objective function 



M My 



L (P, Q) = - V P° V PS logQS* + L (P 1 . Q 1 ) (18) 



/ j H) / j 
i = l ii=l 



Now change notation in order to make contact with previous results on vector 
quantisers (VQ) 



io — > x S» =i "~ *■ I ^ x input vector 

h-*y Sf/ii J2yLi output code index 

if o -*Pr(x) input PDF 

io —> Pr (j/|x) recognition model 

Qijuii ~ * V ( ex P ( ~ ^ j Gaussian generative model 



(V2¥<T) dln 

Q (y) output prior 

(17) 

where x is a continuous-valued input vector (e.g. the activity pattern in layer 
0), er is the (isotropic) variance of the Gaussian generative model, V is an in- 
finitesimal volume element in input space which may be used to convert the 
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Gaussian probability density into a probability, and y is a discrete-valued out- 
put index (e.g. the location of the next neuron to fire in layer 1). Note that 
the parameter V must be introduced in order to regularise the number of bits 
required to specify each source state. In effect, V specifies a resolution scale, 
such that details on smaller scales are ignored. 

The notation defined in equation II 71 allows L (P, Q) to be written as 

i(P , Q) ^ + i(PW) _ los (_^) (18) 

where Dvq is defined as 

. M 

D VQ = 2 dxPr(x)]TPr(zj|x) ||x-x'(y)|| 2 (19) 
J v=i 

The first term in equation ^1 is proportional to the objective function Dvq for 
a soft vector quantiser (VQ), where Pr (y\x) is a soft encoder, and x' (y) is the 
corresponding reconstruction vector attached to code index y, and ||x — x' (y)\\ 
is the L 2 norm of the reconstruction error. A standard VQ 0] (i.e. winner-take- 
all encoder) has Pr(y|x) = <5 a , a ( x ), which emerges as the optimal form when 
this VQ objective function is minimised w.r.t. Pr (y\x) (see ^H] for a detailed 
discussion of these issues). The second term in equation El (i-e. L (P^Q 1 )) is 
the cost of coding the output layer, and the third term is constant. 

The effect of the L (P 1 , Q 1 ) term in L (P, Q) (see eciuation llSJI is to encour- 
age P} — > 8i t i (only one state in layer 1 is used) and Q 1 — > P 1 (perfect model 
in layer 1). The behaviour P* — > 8 itio is in conflict with the requirements of the 
first term (i.e. the soft VQ) in L (P, Q), which requires that more than one state 
in layer 1 is used, in order to minimise the reconstruction distortion. There is 
a tradeoff between increasing the number of active states in layer 1 in order to 
enable the Gaussian generative model (Q is a Gaussian mixture distribution) 
to make a good approximation to the external source P°, and decreasing the 
number of active states in layer 1 in order to make the average total number of 
bits L (P 1 , Q 1 ) required to specify an output state as small as possible. 



3.3 Coupled Soft VQ Networks 

The results of section HOI will now be generalised to an (L + l)-layer network. 
The objective function for coding a Markov source (eauation ll3jl can be written, 
using a notation which is analogous to that given in equation El as 

L ( p < Q) = E jh + L ( pL < Q L ) - E l0 g (t^W) ( 2 °) 

/^O 4 ^) 1=0 \(Vto<Tl) J 

where D 1 V q is defined as (Dvq = Dvq as defined in equation I19H 

D l VQ = 2j dx,Pr(x,) P^yz+ilx/) Hxz-xJ^)!! 2 (21) 
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where x; and yi are both used to denote the state of layer I. The notation X; is 
used to denote the input to the encoder that connects layers I and l + l, whereas 
the notation yi denotes the output of the encoder that connects layers I — 1 and 
I. This redundancy of notation is not actually necessary, but is used here to 
preserve the distinction between input vectors and output codes. 

The first term in equationEUis a weighted sum (where each term is weighted 
by (<7z)~ ) of objective functions for a set of soft VQs connecting each of the L 
neighbouring pairs of layers in the network. This type of network structure will 
be called a VQ-ladder. The second term in equation!^ (i.e. L (P L , Q L )) is the 
cost of coding the output layer, and the third term is constant. 

If the cost L (P L , Q L ) of coding the output layer is ignored, then the multi- 
layer Markov source coding objective function L (P, Q) is minimised by minimis- 

ing the the objective function J2i=o f° r a VQ-ladder (see QH] discussion 

of this point in the context of folded Markov chains (FMC)). As the number 
L of network layers is increased, the effect of the L(P L ,Q L ) term has less 
and less effect on the overall optimisation, because its effect is swamped by the 
VQ-ladder term. 



3.4 Topographic Mapping Network 

The results obtained in section I3T2I for a soft VQ may be generalised to obtain 
a topographic mapping network whose properties closely resemble those of a 
Kohonen network 0. This derivation is based on the approach to topographic 
mappings that was presented in 0. Thus apply equationEJto a 3-layer network 

P= (P°, P^P 2 ^ 

Q=(Q 0| \Q 1|2 ,Q 2 ) (22) 



where only layers and 2 are included in the objective function, to obtain 

L (P, Q) = - £ P° £ 1o sC + L ( p2 > Q 2 ) (23) 

io = l S2=l 

which should be compared with equation El An analogous change of notation 
to that defined in equation El can be made 

io — » x Sj =i ~~ * / ^ x input vector 

i\ — > y hidden code index 

%2 — > z output code index 

PP -> Pr (x) input PDF 

i — > Pr (y|x) recognition model (first stage) 

2| 1 

PJi — > P?(z\y) recognition model (second stage) 



Qij^i 2 — > V 1 ^ asnr ex P (~ ^ X 2g«*^ ! Gaussian g<>n<!mT.ivc mod<?l 



Ql — > Q {z) output prior 



(24) 
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to obtain 

L(P,Q) = ^ + L(P 2 ,Q 2 )-log( (25) 

4(j2 V(^) / 

where Dvq is defined as 

„ M 2 

D VQ = 2 dxPr (x) £ Pr (z|x) ||x - x' (z)|| 2 (26) 

which should be compared with the objective function in equation 1191 

This expression for Dvq explicitly involves the states of layers and 2 of 
a 3-layer network, and it will now be manipulated into a form that explicitly 
involves the states of layers and 1. In order to simplify this calculation, Dvq 
will be replaced by the equivalent objective function ^H] 

/M 2 . 
dxPr (x)^Pr(^ |x) / dx'Pr(x'|z) ||x — x'|| (27) 
z=l ^ 

Now introduce dummy integrations over the state of layer 1 to obtain 

Mi M 2 Mi 



p Jul -1"^ J"l 

D VQ = / dxPr(x)^Pr(2/|x)^Pr(z|y)^Pr(y'|z) 

J y=l z=l y' = l 

x j dx'Pr(x'|y') ||x-x'|| 2 (28) 



and rearrange to obtain 

Mi 



D VQ = J rfxPr(x)^Pr(y'|x) j dx'Pr(xV) ||x-x'| 
y'=i 



where 

M 2 



which may be replaced by the equivalent objective function 

r Mi 

% = 2 /dxPr(x)^Pr(y'|x) ||x - x' (y')f 

y<=\ 

which should be compared with the objective function in equation E 



(29) 



Pr(y'|y) = ^Pr {y'\z)Vx{z\y) 

2=1 
Mi 

Pr(j/|x) =^Pr(y'|y)Pr(y|x) (30) 

i/=i 



(31) 
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The overall effect of manipulating equationlJElinto the form given in equation 
I31l is to convert the objective function from one that explicitly involves the states 
of layers and 2, to one that explicitly involves the states of layers and 1. This 
change is reflected in the replacement of Pr (z|x) by Pr (y'|x). This new form for 
the objective function (see equation EI is exactly the same as for a standard 
VQ (see equation 11911 . except that the posterior probability Pr(y|x) is now 
processed through a transition matrix Pr(y'\y) to produce Pr(y'|x). Because 
Pr(j/' |y) = Y2z=i P r (y'\ z ) -P r ( z \y)> it takes account of the effect of the state 
z of layer 2 on the training of layer 1, which is a type of self-supervision 
in which higher layers of a network coordinate the training of lower layers. 
However, viewed from the point of view of layer 1, the effect of the transition 
matrix Pr(y'\y) is to do damage to the posterior probability by redistributing 
probability amongst the states of layer 1. This process is thus called probability 
leakage, and Pr(y'\y) is called a probability leakage matrix. 

The objective function in equation UH] gives rise to a neural network that 
closely resembles a Kohonen topographic mapping neural network pj], where 
Pr(y'\y) may be identified as the topographic neighbourhood function, as was 
shown in J2J. Note that in order for the topographic neighbourhood to be 
localised (i.e. Pr(y'\y) > only for y' in some local neighbourhood of y), the 
transition matrix Pr(z\y) that generates the state of layer 2 from the state of 
layer 1 must generate each z state from y states that are all close to each other. 
This connection with Kohonen topographic mapping neural networks is only 
approximate, because the training algorithm proposed by Kohonen does not 
correspond to the minimisation of any objective function. A generalised version 
of the Kohonen network which allows a factorial code to emerge may be derived 
using the results in section 9 . 

3.5 Additional Results 

The objective function X)i=o 4(<^? a ^ or a VQ-ladder couples the optimisation of 
the individual 2-layer VQs together. Because the output of the I VQ is the 
input to the (I + 1) VQ (for I = 0, 1, 2 • • • , L — 1 ), the optimisation of the 
k th VQ has side effects on the optimisation of the I th VQs (for I = k + 1, k + 
2, • • • , L— 1). This leads to the effect called self-supervision, in which top-down 
connections from higher to lower network layers are automatically generated, 
to allow the lower layers to process their input more effectively in the light of 
what the higher layers discover in the data jH]- This is the multilayer extension 
of the self-supervision effect that led to topographic mappings in section 13.41 

The general expression for L (P, Q) in equation El is the sum of two terms: 
the objective function Ylt=o Zrf.ii ^li^-n (P' +1 ' z , Q'l /+1 ) for a ladder (because 
Q is not necessarily Gaussian, the ladder is not necessarily a VQ-ladder), plus 
the cost L(P L ,Q L ) of encoding layer L. The L(P L ,Q L ) term has precisely 
the form that is commonly used in density modelling, so any convenient density 
model could be used to parameterise Q L in layer L. A typical implementa- 
tion of the type of network that minimises L (P, Q) thus splits into two pieces 
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corresponding to the two different types of term in the objective function. In 
the special case where L = (i.e. no ladder is used) this approach reduces to 
standard input density modelling. 



4 Hierarachical Encoding using an Adaptive 
Cluster Expansion (ACE) 

In this section the adaptive cluster expansion (ACE) network is discussed 
ACE is a tree-structured network, whose purpose is to decompose high- 
dimensional input vectors into a number of lower dimensional pieces. In section 
14. II the case of a deterministic source and a perfect model is considered, and in 
section l4~2l the case of a Gaussian model is discussed. 



4.1 ACE: Tree-Structured Density Network 

Consider the objective function L (P, Q) for encoding an L + 1 layer Markov 
source (see equation^l, and assume that the part of the model is perfect 

so that Qft 1 = P^t 1 (for 1 = 0,1,- •• ,L- 1), and that the P- +1{1 , part of 

the source is deterministic so that -P^'j, = (for I = 0, 1, • ■ ■ ,L— 1), 

in which case L (P, Q) simplifies as follows (see appendix lA.ljl 

L(P,Q) =H (P°) -H(P L ) +L(P L ,Q L ) (32) 

where H (P°) — H (P L ) is the number of bits per symbol required to convert a 

P L -message into a P°-message, assuming that the Pi^\ l part of the source is 
deterministic, and that the model is perfect. This result is not very interesting 
in itself. 

However, if the f^*'^ part of the source is not only deterministic, but is also 
tree-structured, and the model is similarly tree-structured, then the notation 
must be modified thus 

it ->-i; = (i^if,---) 
k+i — > iz+i = 



p l+l\l p l+l\l _ p l + l\l p l+l\l _ c c 

H+i,i| H+i' 1 ! h+i'h H+i'h+iyh) l i + i' i i + i[h ) 

n l\l + l n l\l+l _ p l\l+l p l\l + l , , 

where the state z; of layer Z of the tree-structured Markov source is more nat- 
urally written as a vector state k that specifies the joint state of each branch 
of layer I of the tree (the i\ style of notation is more suitable for a non-tree- 
structured Markov source). Furthermore, the components of the vector i; are 
partitioned as (ij ,if , ■ ■ ■) , where each if is the joint state of a subset c of nodes 
in layer I, where all the nodes in each subset are all siblings as seen from the 
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point of view of layer I + 1, Such a set of siblings is called a cluster. The com- 
ponents of the vector are partitioned as ■ ■ •), where if +1 is the 
state of the parent (in layer I + 1) of the siblings in cluster c in layer I. 

This notation may be used to rearrange L (P, Q) as follows (see appendix 

rot 



L-l 



L(P,Q) = £ J2 H ( P c)-H E H(P l c )+L(P L ,Q L ) (34) 

l— cluster c l — l component c 

This expression for L (P, Q) can be rewritten in terms of the mutual information 
I (Pj,) between the components of cluster if +1 as (see annendix lA.2j) 



i(p,Q) = -E E '( p ') + E ^( p ")- E ^( p c) 

Z— 1 cluster c cluster c cluster c 

+ L(P L ,Q L ) (35) 



Now assume that the model is perfect in the output layer, so that Q L is given 
by Qf = PkPk ■■■ ■ This allows L (P L , Q L ) to be simplified as L (P L , Q L ) = 

Sciuster c ^ v*c ) > so that ^ O 3 ? Q) ma y finally be expressed as 

L(P,Q) = -£ E 7 ( P c)+ E ^( P ") (36) 

Z=l cluster c cluster c 

The — Y2f=i ^cluster c ^ ( P c) term is (minus) the sum of the mutual informations 
within all of the clusters in the L + 1 layer network, and the ^cluster c H ( P c) 
term is constant for a given external source P°. This means that minimising 
L (P, Q) is equivalent to maximising J2f=i ^cluster c ^ (-^c) ■ This is the maxi- 
mum mutual information result for ACE networks [Jj, which includes the mutual 
information maximisation principle in fl] as a special case. 

Note that if the source is deterministic and the model is perfect (as they 
are here), then L(P°,Q°) = L(P, Q), which implies that input density opti- 
misation is equivalent to joint density optimisation. This equivalence was used 
in [J], where the sum-of- mutual- informations objective function was derived by 
minimising L (P°, Q°) . 

4.2 ACE: Hierarchical Vector Quantiser 

If the above ACE network is modified slightly, so that the model Q has exactly 
the same structure as before, but is Gaussian rather than perfect, then <3'[^* 
becomes 
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where the individual Q^l^c are Gaussian. The expression for L (P, Q) may 
then be written down by analogy with equation 1201 

MP,Q) = E E ^TY + L ( pL ^ L ) 

1—0 cluster c v^»eJ 

-E E ^ imif ) (38) 

i=0 cluster c \ (V ^07, cj / 

Thus the ACE network, with a Gaussian model Q, is a hierarchical VQ-ladder 
(or VQ-tree), in which each layer encodes the clusters in the previous layer 0. 



5 Factorial Encoding using a Partitioned Mix- 
ture Distribution (PMD) 

In this section a useful parameterisation of the conditional probability P' +1 l' 
for building the Markov source is introduced in order to encourage P /+1 l' to 
form factorial codes of the state of layer I. It turns out that there is a simple 
way of allowing such codes to develop, which is called the partitioned mixture 
distribution (PMD) [§]. A PMD achieves this by encoding its input simultane- 
ously with a number of different recognition models, each of which potentially 
can encode a different part of the input. 

In section 15.11 two ways in which multiple recognition models can be used 
for factorial encoding are discussed, and a hybrid approach (which is a PMD) 
is discussed in section lo~2l 



5.1 Multiple Recognition Models 

In the expression for the L (P, Q) (see emia,tion lT3)l the generative models Q'l' +1 
may be parameterised as Gaussian probability densities, whereas the recognition 
models ~P l+1 \ l may be parameterised in a more general way as 

p i\i+i pi+i 

pl+l\l _ ijJi+i n+i /„q\ 

H+iA v^Mi+i pl\l+l pl + l { 1 

which guarantees the normalisation condition y\ff l+ L, Pj I = 1. A limitation 
of this type of recognition model is that it allows only a single explanation iz+i 
of the data ii (in the case of a hard P^W), or a probability distribution over 

single explanations (in the case of a soft P^\ ^J, so it cannot lead to a factorial 
encoding of the data. 

The simplest way of allowing a factorial encoding to develop is to make 
simultaneous use more than one recognition model. Each recognition model 
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uses its own P /+1 vector and P l \ t+1 matrix to compute a posterior probability 
of the type shown in equation OHO so that if each recognition model is sensitised 
to a different part of the input, then a factorial code can develop. This approach 
can be formalised by making the replacement — > in equation 0H3 (i.e. 
replace the scalar code index by a vector code index, where the number of vector 
components is equal to the number of recognition models) . If the components 
of are determined independently of each other, then their joint posterior 
probability f*/^* {, is a product of independent posterior probabilities, where 
each posterior probability corresponds to one of the recognition models, and 
thus has its own P /+1 vector and P'l' +1 matrix. 

If this type of posterior probability, which is a product of n independent fac- 
tors if there are n independent recognition models, is then inserted into equation 
1191 it yields (see appendix [Bj 

Mi M 2 M„ 

D VQ <2 /dxPr(x)^ ■■■ E Pr(w|x,l)Pr(«a|x,2)--- 

2/1=1 2/2=1 2M=1 

2 



•Pr (y n |x,n) 



1 



fc=i 



(40) 



If a single recognition model is independently used n times, rather than n inde- 
pendent recognition models each independently being used once, then the above 
result becomes 



Dvq < 



. M 

/ dxPr(x)^Pr(y|x) ||x-x'(y)|| 
J »=l 



2(ra- 1) 



dx Pr (x) 



x-J^Pr (y\x) x (y) 



(41) 



In the case n = 1 this correctly reduces to equation El (the inequality reduces 
to an equality in this case). When n > lthe second term offers the possi- 
bility of factorial encoding, because it contains a weighted linear combination 
Yj V =i Pr (i/l x ) x ' (v) of vectors. 



5.2 Average Over Recognition Models 

Now combine the above two approaches to factorial encoding, so that a single 
recognition model is used (as in equation HT)) . which is parameterised in such 
a way that it can emulate multiple recognition models (as in equation I40|) . 
The simplest possibility is to firstly make the replacement — > l^ij+i 
(where > 0) in equation EHl where k is a recognition model index which 

ranges over k = 1, 2, • • • , K (note that K is not constrained to be the same as 
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n), and then secondly to average over k, to produce 

K pl\l+l pl+1 

pl+l\l , 1 \ " HM+1 Mi + i + i / 42 n 

In effect, if recognition models are embedded between layer I and layer I + 1, 
and the A /+1 matrix specifies which indices in layer I + 1 are associated 
with recognition model k. 

The result in equation 02] is not the same as the result that would have been 
obtained using a Bayesian analysis, in which the posterior probabilities gener- 
ated by different models are combined to yield a single posterior probability. In 
appendix[B]there is a discussion of the relationship between the above proposed 
PMD recognition model and a full Bayesian average over alternative recognition 
models. 

A partitioned mixture distribution (PMD) is precisely this type of multiple 
embedded recognition model. In the simplest type of PMD the A* +1 matrix is 
chosen to contain only O's and l's, which are arranged so that the K recognition 
models partition layer I + 1 into K overlapping patches j^j. A wide range of 
types of PMD can be constructed by choosing A i+1 appropriately. 

In section 13.41 it was shown how a Kohonen topographic mapping emerged 
when a 3-layer Markov source network was optimised. If the PMD posterior 
probability (see equation El had been used in section OQ] then a more general 
form of topographic mapping (i.e. a factorial topographic mapping) would have 
emerged (this is briefly discussed in P]). 

6 Conclusions 

The objective function for optimising the density model of a Markov source may 
be applied to the problem of optimising the joint density of all the layers of a 
neural network. This is possible because the joint state of all of the network 
layers may be viewed as a Markov chain of states (each layer is connected only to 
adjacent layers). This representation makes contact with the results that were 
reported in ^H], and allows many results to be unified into a single approach 
(i.e. a single objective function). 

The most significant aspect of this unification is the fact that all layers 
of a neural network are treated on an equal footing, unlike in the conventional 
approach to density modelling where the input layer is accorded a special status. 
For instance, this leads to a modular approach to building neural networks, 
where all of the modules have the same structure. 
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A ACE 

In this appendix some of the more technical details relevant to section Q] are 
given. 

A.l Perfect Model, Deterministic Source 

The derivation of the result in equation E21 for a perfect model (i.e. Q = P) 
and a deterministic source (i.e. P^l', = using scalar notation ii 

rather than vector notation i; for the state of layer I, because here the Markov 
source is not assumed to be tree-structured) is as follows. The basic definition 
of L (P, Q) in equation El may be written as 

L(P,Q)-Z,(P-,Q^)=-££p< £ P£li l °Z P Kl ( 43 ) 

(=0 ii = l ii+i=l 

This may be simplified by noting that P^W — and that Bayes' 

l\l+i _ tji+unfH 



theorem gives P^^ ± = H pl+ l i - , which yields 



H + l 

L-l Mi M 1+ i „ p i 

2=0 i ( =l i !+ i=l ii+i 

(44) 

Now use that Ef+t=i **n-iA+i(<0 = *> ^f+t=i s k+xA+iW lo S S i l+ i,ii+x(ii) = 
and Ef=i ^A+i^H-ifr) = ^ to reduce this to 



L-l Mi L-l M( + i 

L(P,Q)-i(P i ,Q L )=-^^^log^ i +^ ^ ^log^ (45) 

2=0 i l= l (=0 t ( + i = l 

The terms in these two series mostly cancel each other to yield 

M M L 
L (P, Q) - L (P L , Q L ) = - ]T Pi log P° + 53 PP log P^ (46) 

j = l lL = l 

and using the definition of entropy (see equation this may finally be written 

as 

L (P, Q) - L (P L , Q L ) =H (P°) -i? (P L ) (47) 



19 



A. 2 Perfect Model, Deterministic Source: Tree- 
Structured Case 

The derivation of the result in equation 021 for a perfect tree-structured model 
and a deterministic tree-structured source is may be obtained by altering the 
notation in annendix lA.ll to reflect the fact that both the Markov source and 
model are now tree-structured. Thus use the notation defined in equation \XM to 
write L (P, Q) (see equation as 



pl+i 



1—0 ii + i cluster c 

pi + i|i p i c 

Now use Bayes' theorem in the form Pjct, 1 = ''tl+'i — - to write this as 

j+i 

mp,q)-mp l ! q l ) = -EE^E<^ E Jos 

I— i; cluster c 

(49) 

This may be simplified by using that J2 it x p i +ll i = 1 ( for the lo S^ term )> 
E,,^E, 1+1 <:!•(•••) = E^O---)^ the logi^ term), and 

EjJtii Vi^f+iCf) Iog< ^f +1 (*?) = (for the log ^!!i? term )' to y ield 

L-l 

L(P,Q)-L(P i ,Q L )=-^^^ l0 % P k 

1=0 i/ cluster c 

+ee<: e ^ ( 5 °) 

l— cluster c 

The first term may be simplified by interchanging the order of sum- 
mation J2i E c (' ' ' ) = EcEi ("")) an£ l then marginalising the proba- 

bihties USing that ^cluster c Ei, K log He = Ecluster c Ei f ^ log f? f . The 

second term may be simplified by interchanging the order of summation 
Ei i+1 Ecluster c (■■■) = Ecluster c Ei !+1 (• ■ • ). then marginalising the probabil- 
ities USing that Ecluster c E ii + 1 ^t: ^ = E cll)ster c E if+1 log ^ , 

and then using that component c in layer I + 1 is the parent of cluster c in layer 
I, to obtain 

L-l 

L(P,Q)-L(P L 1 Q L )=-Y / Y, P k E lo s4 

2=0 ijr cluster c 

L 



EE^ E ^ (si) 

l—l if component c 
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and using the definition of entropy (see equation this may finally be written 

as 

L(P,Q)-L(P\Q L )=J2 £ H(Pl)-jr £ H(P l c ) (52) 

l— cluster c l—l component c 

where (P'j is the entropy of cluster c and H (pj) is the entropy of component 
c (both in layer ?). The mutual information I (P l c ) between the components d 
of cluster c is defined as 

l(P<) = ]T ff(i^)-ff(P J c ) (53) 

component c 
in cluster c 

and USing that Ecluster c Ecomponent c' ' ' ) = Ecomponent d this y ields 

in cluster c 

E '( p ')- E tf(^i)- E ^( p ') (54) 

cluster c component c cluster c 

which allows L (P, Q) - L (P L ,Q L ) to be simplified to 
L(P,Q)~L(P L ,Q L )=-f: E J ( P ') 

?= 1 cluster c 

+ E tf( p ")- E ^( p c) (55) 

cluster c cluster c 

B PMD 

In this appendix some of the more technical details relevant to section are 
given. 

B.l PMD Recognition Model 

If the type of posterior probability introduced in section l5~Tl which is a product 
of n independent factors if there are n independent recognition models, is then 
inserted into equation EI] it yields a Dvq of the form 



Mi M 2 M„ 

D VQ = 2j dx Pr (x) E • • • E Pr fa l x > X ) Pr ^ l x ' 2 ) 

2/1=13/2=1 2fe=l 



•••Pr(y„|x,n)||x-x'(yi,y 2 ,--- , 2/r») || (56) 

where Pr(j/fc|x, fe) denotes the posterior probability that (given input x) code 
index yu occurs in recognition model k. If x' (y\, y%, ■ ■ ■ ,y n ) is optimised (i.e. 
takes the value that minimises Dvq) then it becomes 

x'(yi,y2,--- ,Vn) = J dxPr(yi|x)Pr(y 2 |x) • • -Pr(y„|x) x (57) 
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The ||x — x' (j/i, U2, ■ ■ ■ , y n )\\ 2 term may be expanded thus (by adding and sub- 
tracting i£Li x fc {Vk)) 



||x-x'(yi,y 2 ,-- - ,y n W 



( x - k ELi 4 (vk)) 

+ Efc=l X 'fc (3/*) _ X ' (yi'J/2, • • • , Vn)) 



(58) 

Using these two results, together with Bayes' theorem, allows an upper bound 
on Dvq to be derived as 



Mi M 2 M„ 

Dy Q <2 /dxPr(x)^ J^... ^Pr( yi |x,l)Pr(y 2 |x,2)- 

5(1=1 J/ 2 =l Kn = l 



•Pr (y n |x,n) 



1 - 



k=l 



(59) 



In this upper bound, the Pr (j/fc|x, A;) are used to produce soft encodings in 
each of the recognition models (k = 1, 2, • • • , n), then a sum - J2k=i x 'k (V k ) °f 
the vectors x' fe (t/fc) is used as the reconstruction of the input x. In the special 
case where hard encodings are used, so that Pr (yfc|x, k) = 8y k ,y k ( x )i then the up- 
per bound on D V q reduces to D V q <2j dxPr(x) ||x-~ J2k=i x 'fe (Vk ( x ))|| 2 - 
Note that the code vectors used for the encoding operation y^ (x) are not nec- 
essarily the same as the x' k (yk), except in the special case n = 1. 

Suppose that a single recognition model is independently used n times, rather 
than n independent recognition models each independently being used once. 
This corresponds to constraining the P' +1 vectors and P'l i+1 matrices to be the 
same for each of the n recognition models. The upper bound on Dvq can be 
manipulated into the form 

M 



Dvq< - [ rfxPr(x)^Pr(y|x) ||x-x'(y)| 
71 J y=l 



2(ra- 1) 



rfxPr (x) 



M 



x-^Pr(y|x)x' (y) 



(60) 



where the k index is no longer needed. 



B.2 Full Bayesian Average Over Recognition Models 

One possible criticism of the recognition model given in equation 23 ls that it is 
a mixture of K recognition models, where each contributing model is assigned 
the same weight -g*. Normally, a posterior probability Pr(y|x) is decomposed 
as a sum over posterior probabilities Pr (y|x, k) derived from each contributing 
model, as follows 

K 

Pr(y|x) =^Pr(y|x,fc)Pr(fc|x) (61) 

k=l 
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where each of the K recognition models is assigned a different data-dependent 
weight Pr (fc|x). The conditional probabilities Pr (fc|x) and Pr (y|x) can be eval- 
uated to yield 



£SliPr(x|y,fc)Pr(i/|fc)Pr(fc) 

ELi EjLi Pr ( x ly'> fc Pr (2/1*0 Pr (*0 

Pr (x|y,fc)Pr(y|/c)Pr(fc) 
Eil 1 Pr(x|y',fc)Pr(y'|fc)Pr (fc) 



Pr(fc|x) 



Pr (fl*. *) = ^a/ I . . J ~ (62) 



so that 



^ / , s Pr(x|y,A:)Pr(dfc)Pr(fc) , . 

Pr(yx)=V — p |y ' ^ ^-^ — 63 

k^i EifcLi E^=i p r ( x l2/', Pr (y'\ k ) Pr ( fc 
If the replacements Pr (fc) -» 1, Pr(y|A) -» p r(*M) -> Pfi^, 

arp marlp tllpn Prl7/lv 



and Pr (y|x) — > P/ +1 \ are made, then Pr (y|x) reduces to 



K pl\l+l jd+l pl+l 

Z^fc'=i 2w{ +1 =i k',i' l+1 



k=l 



which is not the same as the PMD recognition model in equation 22 The 
difference between equation U2| and equation El arises because the full Bayesian 
approach in equation E2] ensures that the model index k and the input x are 
mutually dependent (via the factor Pr(fc|x)), whereas the PMD approach in 
equation El ignores such dependencies. 

In the full Bayesian approach (see equation|J2J the normalisation term in the 
denominator has a double summation 5~]fLi Y\^f t+ ^ P V" 1 Aft 1 P/ +1 , which 

involves all pairs of indices k and with ^ > 0, which thus corresponds 
to long-range lateral interactions in layer i + l- On the other hand, in the PMD 
approach (see equation I42J1 the normalisation term in the denominator has only 
a single summation y^f /i+ ii P f 1 A 1 ^} P' +1 , so the lateral interactions in 



layer I + 1 are determined by the structure of the matrix A l ^ i+i , which defines 
only short-range lateral connections (i.e. for a given recognition model k, only 
a limited number of index values satisfy '} j > 0. 



C Comparison with the Helmholtz Machine 

In this appendix the relationship between two types of density model is dis- 
cussed. The first type is a conventional density model that approximates the 
input probability density (i.e. the objective function is L(P°,Q )), and the 
second type is the one introduced here that approximates the joint probability 
density of a Markov source (i.e. the objective function is L (P, Q)). In order to 
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relate L (P°, Q°J to L (P, Q) it is necessary to introduce additional layers (i.e. 
layers 1, 2, • • • ,L) into L (P°, Q°) in an appropriate fashion. 

The Helmholtz machine (HM) does this by replacing L (P°, Q°) by a 
different objective function (which has these additional layers present as hid- 
den variables), and which is an upper bound on the original objective function 
L (P°, Q°) . It turns out that Helmholtz machine (HM) objective function Dhm 
and the Markov source objective function L (P, Q) are closely related. The es- 
sential difference between the two is that Dhm does not include the cost of 
specifying the state of layers 1, 2, • • • , L given that the state of layer is known, 
which thus allows it to develop distributed codes (which are expensive to specify) 
more easily. 

In the conventional density modelling approach to neural networks, there 
are two basic classes of model. In the case of both unsupervised and supervised 
neural networks the source is P°, which is the network input (unsupervised 
case) or the network output (supervised case). Additionally, in the case of 
supervised neural networks P° is conditioned on an additional network input as 
pO|mput Thus m both cases there is only an external source (i.e. source layers 
1,2,- •• ,L are not present), which is modelled by Q° (unsupervised case) or 
Q0|mput ( SU p erv i sec j case). Q° or Q°l in P ut can be modelled in any way that is 
convenient. Frequently a multilayer generative model of the form 

is used, where the ii (for 1 < I < L) are hidden variables, which need to be 
summed over in order to calculate the required marginal probability Q° Q , and 
the notation is deliberately chosen to be the same as is used in the Markov chain 
model 

Qio,ii,--- = Qi ,ii ' ' ' ' ' ' (^) 

Helmholtz machines and Markov sources are related to each other. Thus the 
L (P°, Q°) that is minimised in conventional density modelling can be manip- 
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ulated in order to derive Dhm 



L (P°, Q°) < L (P°, Q°) + Y KGio (P 1|Q , Q 110 ) 

i =l 

M Mi 

io=l ii = l 
M Mi 

' tO 1 «1, 10 B «l,tO 

i =l ii=l 

Afo 

= L ((p°, P 1 ' ) , (q°, Q 1 ' )) - Y P l H io (P 11 



i = l 
Mo 

= L(P,Q)-^<^ (P 1 I°) 

i = l 

= Dhm (67) 

The inequality L (P°, Q°) < Dhm follows from G 4o (P 1 ' , Q 1 ' ) > (i.e. the 
model Q 1 ! is imperfect, so that Q 1 ' f P 1 ' ). The inequality D HM < L (P, Q) 
follows from H io (P 1 ! ) > (i.e. the source P 1 ' is stochastic). If the model is 
perfect (Q 1 ' = P 1 ' ) and the source is deterministic (P 1 ! is such that the state 
of layer 1 is known once the state of layer is given) , then these two inequalities 
reduce to L (P°, Q°) = L (P, Q). 

The properties of the optimal codes that are used by a Helmholtz machine 
when Dhm is minimised may be investigated by writing the expression for Dhm 
as a sum of two terms 

Mo M 

Dhm = £ PlK la (p^Q 011 ) + £ P°G io (p^Q 1 

i = l io = l 

The Et=i^o X ( pl|0 'Q 011 ) P art and the E l * f =i^ o G (P 1|0 1 Q 1 ) Part com- 
pete with each other when Dhm is minimised. Assuming that P? > 0, the 

Eilf=i P i°o G ( pl '°' Q 1 ) P art likes t0 make Q 1 approximate P 1 ' , which tends to 
make P 1 ' behave like a distributed encoder. On the other hand, assuming that 
PA > 0, the YZLi P i K ( pl| °> Q ' 1 ) Part likes to make Q ! 1 approximate P ' 1 , 
which tends to make P 1 ' behave like a sparse encoder. The tension between 
these two terms is optimally balanced when Dhm is minimised. 

The properties of the optimal codes that are used in the Markov source 
approach when L(P,Q) (see equation El is minimised are different. The 2- 
layer expression for L (P, Q) is 

Mo 

L (P, Q) = £ P° K i0 (P 1 ' , Q 0|1 J + L (P L , Q L ) (69) 

io=l 
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which contains the same sparse encoder term $Zi =i Pi Kio (P 1 ' ) Q ' 1 ) as 
D HM . However, the distributed encoder term Y^=i ^io^io (P 1 ' ; Q 1 ) ls miss- 
ing, and is replaced by L (P L , Q L ) which does not have the effect of encouraging 
any particular type of code (other than one in which P L approximates Q L ). 

These differences between Dhm and L (P, Q) show how the Markov source 
approach encourages sparse codes to develop, whereas the Helmholtz machine 
does not. It is not clear whether using Dhm is the best approach to forming 
distributed codes, because there are other ways of encouraging distributed codes 
to develop, such as the factorial encoder discussed in section Q3 which is based 
on L (P, Q) rather than Dhm- 
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